Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix checkbox remote running memory stress test (Bugfix) #1167

Merged
merged 1 commit into from
Apr 23, 2024

Conversation

pseudocc
Copy link
Contributor

@pseudocc pseudocc commented Apr 10, 2024

Description

Once a sub process of checkbox-ng service got killed by OOM killer, the service will be stopped, because of DefaultOOMPolicy=stop, and the memory stress test will be marked as crashed, or the whole test plan will start over after the service restarts.

This PR explicitly set OOMPolicy=continue so the service will be able to remain alive, and checkbox remote could finish the memory stress test.

Resolved issues

Fixes: #571

systemd[1]: checkbox-ng.service: A process of this unit has been killed by the OOM killer.
kernel: oom_reaper: reaped process 908499 (stress-ng-stack), now anon-rss:15720kB, file-rss:704kB, shmem-rss:0kB
systemd[1]: checkbox-ng.service: State 'final-sigterm' timed out. Killing.
systemd[1]: checkbox-ng.service: Killing process 908487 (stress-ng) with signal SIGKILL.
systemd[1]: checkbox-ng.service: Killing process 908488 (stress-ng-stack) with signal SIGKILL.
systemd[1]: checkbox-ng.service: Killing process 908489 (stress-ng-stack) with signal SIGKILL.

References

systemd[1]: checkbox-ng.service: A process of this unit has been killed by the OOM killer.

Logged from systemd source.

OOMPolicy part of man systemd.service:

 OOMPolicy=
     Configure the out-of-memory (OOM) killing policy for the kernel and the
     userspace OOM killer systemd‐oomd.service(8). On Linux, when memory becomes
     scarce to the point that the kernel has trouble allocating memory for
     itself, it might decide to kill a running process in order to free up memory
     and reduce memory pressure. Note that systemd-oomd.service is a more
     flexible solution that aims to prevent out-of-memory situations for the
     userspace too, not just the kernel, by attempting to terminate services
     earlier, before the kernel would have to act.

     This setting takes one of continue, stop or kill. If set to continue and a
     process in the unit is killed by the OOM killer, this is logged but the unit
     continues running. If set to stop the event is logged but the unit is
     terminated cleanly by the service manager. If set to kill and one of the
     unit's processes is killed by the OOM killer the kernel is instructed to
     kill all remaining processes of the unit too, by setting the
     memory.oom.group attribute to 1; also see kernel documentation[2].

     Defaults to the setting DefaultOOMPolicy= in systemd‐system.conf(5) is set
     to, except for units where Delegate= is turned on, where it defaults to
     continue.

     Use the OOMScoreAdjust= setting to configure whether processes of the unit
     shall be considered preferred or less preferred candidates for process
     termination by the Linux OOM killer logic. See systemd.exec(5) for details.

     This setting also applies to systemd‐oomd.service(8). Similarly to the
     kernel OOM kills performed by the kernel, this setting determines the state
     of the unit after systemd-oomd kills a cgroup associated with it.

Tests

Control group: https://certification.canonical.com/hardware/202303-31332/submission/363424/

Experiment group: https://certification.canonical.com/hardware/202303-31332/submission/363564/

The stress-ng stressors got killed but checkbox-ng.service remain alive. Kindly check dmesg-after-memory-stress.txt for more.

Once a sub process of checkbox-ng service got killed by OOM killer, the
service will be stopped, because of `DefaultOOMPolicy=stop`, and the memory
stress test will be marked as crashed, or the whole test plan will start
over after the service restarts.

> systemd[1]: checkbox-ng.service: A process of this unit has been killed by the OOM killer.
> kernel: oom_reaper: reaped process 908499 (stress-ng-stack), now anon-rss:15720kB, file-rss:704kB, shmem-rss:0kB
> systemd[1]: checkbox-ng.service: State 'final-sigterm' timed out. Killing.
> systemd[1]: checkbox-ng.service: Killing process 908487 (stress-ng) with signal SIGKILL.
> systemd[1]: checkbox-ng.service: Killing process 908488 (stress-ng-stack) with signal SIGKILL.
> systemd[1]: checkbox-ng.service: Killing process 908489 (stress-ng-stack) with signal SIGKILL.

This commit explicitly set `OOMPolicy=continue` so the service will be
able to remain alive, and checkbox remote could finish the memory stress
test.
Copy link

codecov bot commented Apr 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 43.09%. Comparing base (be41d70) to head (edc6783).
Report is 5 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1167   +/-   ##
=======================================
  Coverage   43.09%   43.09%           
=======================================
  Files         355      355           
  Lines       38602    38602           
  Branches     6556     6556           
=======================================
  Hits        16634    16634           
  Misses      21302    21302           
  Partials      666      666           
Flag Coverage Δ
checkbox-ng 67.54% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pieqq
Copy link
Collaborator

pieqq commented Apr 12, 2024

This looks promising for Debian-packaged version of Checkbox!

I would like to find a way to replicate this for the snap version too. Sadly, I'm not sure how to do. I've asked on the Snapcraft forum, we'll see!

@Hook25
Copy link
Collaborator

Hook25 commented Apr 23, 2024

I will merge this as it fixes at least half of the issue. Tyvm for the contribution, we will try to fix it on snaps as well

@Hook25 Hook25 merged commit 82d94c7 into main Apr 23, 2024
12 checks passed
@Hook25 Hook25 deleted the OOMPolicy-continue branch April 23, 2024 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Checkbox remote will keep start over the whole test plan when running memory stress tests
3 participants